-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-40710][DOCS] Supplement undocumented parquet configurations in documentation #38160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
dcoliversun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @HyukjinKwon @dongjoon-hyun
It would be good if you have a time to review :)
| <td>1.3.0</td> | ||
| </tr> | ||
| <tr> | ||
| <td><code>spark.sql.parquet.int96TimestampConversion</code></td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Lines 899 to 905 in 309638e
| val PARQUET_INT96_TIMESTAMP_CONVERSION = buildConf("spark.sql.parquet.int96TimestampConversion") | |
| .doc("This controls whether timestamp adjustments should be applied to INT96 data when " + | |
| "converting to timestamps, for data written by Impala. This is necessary because Impala " + | |
| "stores INT96 data with a different timezone offset than Hive & Spark.") | |
| .version("2.3.0") | |
| .booleanConf | |
| .createWithDefault(false) |
| <td>2.3.0</td> | ||
| </tr> | ||
| <tr> | ||
| <td><code>spark.sql.parquet.outputTimestampType</code></td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Lines 911 to 921 in 309638e
| val PARQUET_OUTPUT_TIMESTAMP_TYPE = buildConf("spark.sql.parquet.outputTimestampType") | |
| .doc("Sets which Parquet timestamp type to use when Spark writes data to Parquet files. " + | |
| "INT96 is a non-standard but commonly used timestamp type in Parquet. TIMESTAMP_MICROS " + | |
| "is a standard timestamp type in Parquet, which stores number of microseconds from the " + | |
| "Unix epoch. TIMESTAMP_MILLIS is also standard, but with millisecond precision, which " + | |
| "means Spark has to truncate the microsecond portion of its timestamp value.") | |
| .version("2.3.0") | |
| .stringConf | |
| .transform(_.toUpperCase(Locale.ROOT)) | |
| .checkValues(ParquetOutputTimestampType.values.map(_.toString)) | |
| .createWithDefault(ParquetOutputTimestampType.INT96.toString) |
| <td>1.2.0</td> | ||
| </tr> | ||
| <tr> | ||
| <td><code>spark.sql.parquet.aggregatePushdown</code></td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Lines 1003 to 1010 in 309638e
| val PARQUET_AGGREGATE_PUSHDOWN_ENABLED = buildConf("spark.sql.parquet.aggregatePushdown") | |
| .doc("If true, aggregates will be pushed down to Parquet for optimization. Support MIN, MAX " + | |
| "and COUNT as aggregate expression. For MIN/MAX, support boolean, integer, float and date " + | |
| "type. For COUNT, support all data types. If statistics is missing from any Parquet file " + | |
| "footer, exception would be thrown.") | |
| .version("3.3.0") | |
| .booleanConf | |
| .createWithDefault(false) |
| <td>1.5.0</td> | ||
| </tr> | ||
| <tr> | ||
| <td><code>spark.sql.parquet.respectSummaryFiles</code></td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Lines 872 to 879 in 309638e
| val PARQUET_SCHEMA_RESPECT_SUMMARIES = buildConf("spark.sql.parquet.respectSummaryFiles") | |
| .doc("When true, we make assumption that all part-files of Parquet are consistent with " + | |
| "summary files and we will ignore them when merging schema. Otherwise, if this is " + | |
| "false, which is the default, we will merge all part-files. This should be considered " + | |
| "as expert-only option, and shouldn't be enabled before knowing what it means exactly.") | |
| .version("1.5.0") | |
| .booleanConf | |
| .createWithDefault(false) |
| <td>1.6.0</td> | ||
| </tr> | ||
| <tr> | ||
| <td><code>spark.sql.parquet.enableVectorizedReader</code></td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Lines 1033 to 1038 in 309638e
| val PARQUET_VECTORIZED_READER_ENABLED = | |
| buildConf("spark.sql.parquet.enableVectorizedReader") | |
| .doc("Enables vectorized parquet decoding.") | |
| .version("2.0.0") | |
| .booleanConf | |
| .createWithDefault(true) |
| <td>2.3.0</td> | ||
| </tr> | ||
| <tr> | ||
| <td><code>spark.sql.parquet.columnarReaderBatchSize</code></td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Lines 1058 to 1063 in 309638e
| val PARQUET_VECTORIZED_READER_BATCH_SIZE = buildConf("spark.sql.parquet.columnarReaderBatchSize") | |
| .doc("The number of rows to include in a parquet vectorized reader batch. The number should " + | |
| "be carefully chosen to minimize overhead and avoid OOMs in reading data.") | |
| .version("2.4.0") | |
| .intConf | |
| .createWithDefault(4096) |
| <td>2.4.0</td> | ||
| </tr> | ||
| <tr> | ||
| <td><code>spark.sql.parquet.fieldId.write.enabled</code></td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Lines 1065 to 1072 in 309638e
| val PARQUET_FIELD_ID_WRITE_ENABLED = | |
| buildConf("spark.sql.parquet.fieldId.write.enabled") | |
| .doc("Field ID is a native field of the Parquet schema spec. When enabled, " + | |
| "Parquet writers will populate the field Id " + | |
| "metadata (if present) in the Spark schema to the Parquet schema.") | |
| .version("3.3.0") | |
| .booleanConf | |
| .createWithDefault(true) |
| <td>3.3.0</td> | ||
| </tr> | ||
| <tr> | ||
| <td><code>spark.sql.parquet.fieldId.read.enabled</code></td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Lines 1074 to 1081 in 309638e
| val PARQUET_FIELD_ID_READ_ENABLED = | |
| buildConf("spark.sql.parquet.fieldId.read.enabled") | |
| .doc("Field ID is a native field of the Parquet schema spec. When enabled, Parquet readers " + | |
| "will use field IDs (if present) in the requested Spark schema to look up Parquet " + | |
| "fields instead of using column names") | |
| .version("3.3.0") | |
| .booleanConf | |
| .createWithDefault(false) |
| <td>3.3.0</td> | ||
| </tr> | ||
| <tr> | ||
| <td><code>spark.sql.parquet.fieldId.read.ignoreMissing</code></td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Lines 1083 to 1090 in 309638e
| val IGNORE_MISSING_PARQUET_FIELD_ID = | |
| buildConf("spark.sql.parquet.fieldId.read.ignoreMissing") | |
| .doc("When the Parquet file doesn't have any field IDs but the " + | |
| "Spark read schema is using field IDs to read, we will silently return nulls " + | |
| "when this flag is enabled, or error otherwise.") | |
| .version("3.3.0") | |
| .booleanConf | |
| .createWithDefault(false) |
| <td>3.3.0</td> | ||
| </tr> | ||
| <tr> | ||
| <td><code>spark.sql.parquet.timestampNTZ.enabled</code></td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Lines 1092 to 1101 in 309638e
| val PARQUET_TIMESTAMP_NTZ_ENABLED = | |
| buildConf("spark.sql.parquet.timestampNTZ.enabled") | |
| .doc(s"Enables ${TimestampTypes.TIMESTAMP_NTZ} support for Parquet reads and writes. " + | |
| s"When enabled, ${TimestampTypes.TIMESTAMP_NTZ} values are written as Parquet timestamp " + | |
| "columns with annotation isAdjustedToUTC = false and are inferred in a similar way. " + | |
| s"When disabled, such values are read as ${TimestampTypes.TIMESTAMP_LTZ} and have to be " + | |
| s"converted to ${TimestampTypes.TIMESTAMP_LTZ} for writes.") | |
| .version("3.4.0") | |
| .booleanConf | |
| .createWithDefault(true) |
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are OK. I don't think any are meant to be hidden or internal-only.
Are these logically ordered?
|
@srowen Yes, related configurations are grouped together and are in logical order with each other. |
|
Can one of the admins verify this patch? |
What changes were proposed in this pull request?
This PR aims to supplement undocumented parquet configurations in documentation.
Why are the changes needed?
Help users to confirm configurations through documentation instead of code.
Does this PR introduce any user-facing change?
Yes, more configurations in documentation.
How was this patch tested?
Pass the GA.